cellula is a simple R-based pipeline
for single cell RNA-seq processing with a number of methods for
integration and identity assignment.
cellula follows the practices outlined in the OSCA book, with
some additional options for integration/batch effect correction
methods.
As a one-stop solution, this package tends to make choices for the
users, with the caveat that these choices follow either defaults or
sensible implementations. However, this means that a certain degree of
freedom is removed from the end user. This assumes that users who desire
total control on the process (or granular specification of parameters)
do not need cellula and would be more comfortable setting
up their own analysis pipelines.
cellula exists to automate and share routine analyses
the way I usually do them, and offer “quick and dirty” access for
exploratory data analysis.
Where does the name “papplain” come from? it’s an inside joke with some friends and ex colleagues.
Why did you change it to “cellula”? one day I’d like to share this tool and I need a name that is not too dumb.
For the time being, clone the repo and install:
devtools::install("path/to/cloned/git")
cellula is dependency-heavy, which is
not something I’m proud of, but makes sense considering this is a
wrapper to a series of different analytical approaches.
The package will require a number of BioConductor and
GitHub dependencies. You can install them as follows:
BiocManager::install(c("scran", "scuttle", "bluster", "scater", "batchelor", "DropletUtils",
"AUCell", "harmony", "GSVA", "gdagstn/oveRlay", "UCell",
"slingshot", "TSCAN"))
The pipelines in cellula assume that you are working
with the output of CellRanger (or something similar) and
you imported it into a SingleCellExperiment object
(hereafter SCE) using DropletUtils::read10Xcounts(). This
is relevant for gene identifiers, since the rowData slot of
the SCE will have a “Symbol” and a “ID” column.
For demo purposes we can use a publicly available dataset,
Segerstolpe et al. 2016 ref, which we
retrieve using the scRNAseq package:
# BiocManager::install("scRNAseq")
sce <- scRNAseq::SegerstolpePancreasData()
colnames(rowData(sce)) = c("Symbol", "ID")
Assuming that you have formed a SCE object containing the “individual” column that identifies different batches, you can run an integration pipeline as follows:
sce <- cellula(sce, name = "myproject", batch = "individual",
integration_method = "Harmony",
verbose = TRUE, save_plots = TRUE)
The name argument defines the name of the folder that
will be created to store files and plots. We set
verbose = TRUE to print the progress of the pipeline.
Setting save_plots = TRUE will create a few QC plots in the
name/plots folder: total UMI, total genes detected, UMI x
genes; optionally % MT, % Ribo and %MALAT1, total UMI x doublet class.
Plots are separated according to whether the cells were discarded or not
in the filtering step.
The cellula() function is a wrapper around a few modules
or sub-pipelines that have different degrees of customization.
The scheme is:
cellula
├── Quality Control [QC]
| ├── run emptyDrops (optional) [QC/EMPTY]
| ├── score mito/ribo/malat1 subsets (optional)
| ├── filter out (optional)
| └── doublet finding (optional) [QC/DBL]
├── Normalization and dimensionality reduction [NOR]
| ├── pre-clustering
| ├── computing pooled factors
| ├── log-normalization (simple or multi-batch)
| ├── HVG finding (simple or multi-batch)
| ├── PCA
| └── UMAP
├── Integration [INT] (optional) - choose one method
| ├── fastMNN [INT/MNN]
| | ├── integration
| | └── UMAP
| ├── Seurat [INT/SEURAT]
| | ├── conversion to Seurat
| | ├── normalization and HVG finding
| | ├── find integration anchors
| | ├── integrate data
| | ├── scale data
| | ├── PCA
| | └── UMAP
| ├── LIGER [INT/LIGER]
| | ├── conversion to LIGER
| | ├── normalization
| | ├── HVG finding
| | ├── scale data
| | ├── NMF
| | ├── quantile normalization
| | └── UMAP
| ├── Harmony [INT/HARMONY]
| | ├── Harmony matrix (on PCA)
| | └── UMAP
| └── Regression [INT/regression]
| ├── regression on logcounts
| ├── PCA
| └── UMAP
└── Cell type annotation [ANNO] (optional) - choose one method
├── Seurat AddModuleScore
├── ssGSEA
├── UCell
└── AUCell
Most of the choices can be made around the integration method.
cellula has implemented 5 methods: fastMNN and
regressBatches from the batchelor package,
Harmony, the CCA-based Seurat method, and
non-negative matrix factorization (NMF) from LIGER through
the rliger package.
LIGER and Seurat integration methods
require an intermediate step where package-specific objects are created
and some pre-processing steps are repeated again according to the best
practices published by the authors of those packages.
Each step of the pipeline can be called independently on the object:
sce <- doQC(sce, name = "segerstolpe", batch = "individual", save_plots = TRUE)
sce <- doNormAndReduce(sce, name = "segerstolpe", batch = "individual")
sce <- integrateSCE(sce, batch = "individual", method = "Seurat")
There are a few simple plotting functions in
cellula:
plot_UMAP() to plot a UMAP (or any other 2D dimensional
reduction)plot_Coldata() to plot data from the
colData slot as a boxplot, scatterplot or confusion
matrixplot_dots() to plot a dot plot of gene expressionplot_UMAP()You can choose point color using the color_by argument,
and facetting is supported via the group_by argument.
Additionally you can choose a shape_by for symbols, and
label_by to place labels on the plot. Note that shape,
group, and labels need to be categorical (i.e. factor) variables,
whereas color can be numeric. The color palette is automatically
generated, but it can be set by the user through the
color_palette argument.
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "individual", group_by = "disease")
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "sum")
plot_Coldata()Takes as input x and y as column names from
colData(sce), with an optional color_by and
group_by argument for facetting.
This function returns different plots depending on the class of the 2
colData columns selected: - if y is a numeric
and x is categorical (character or factor), it returns a
combined violin-boxplot with one plot per level of x.
plot_Coldata(sce, x = "individual", y = "sum") + scale_y_log10()
Additionally, if the color_by argument specifies another
column, every x will be divided by levels of
color_by. With the appropriate use of the x,
color_by and group_by variables once an look
at 3 different groupings of y at once.
plot_Coldata(sce, x = "individual", y = "sum", color_by = "disease", group_by = "cell type") + scale_y_log10()
if y and x are both categorical, it
returns a heatmap of the confusion matrix where every value is the
pairwise Jaccard index between sets for any given level pair (this is
mostly useful to check for differences in
clustering/annotations)
if y and x are both numeric, it returns
a scatterplot with an optional 2D kernel density contour plot
overlaid.
plot_Coldata(sce, x = "sum", y = "detected") + scale_x_log10()
plot_dots() - see below.
Since cellula is mostly based on R/Bioconductor
packages, it offers the BiocParallel parallelization
backend through its parallel_param argument. In some cases,
e.g. the Seurat integration, another type of backend has to be set up
separately outside of the function call.
BiocParallel parallelization is implemented where possible, i.e. all the steps of the pipeline where it is sensible to use them (PCA, clustering, integration…).
sce <- integrateSCE(sce, batch = "individual", method = "fastMNN", parallel_param = MulticoreParam(workers = 4))
cellula also offers a wrapper around clustering
functions. For now, only SNN-based Louvain and Leiden clustering are
implemented. The makeGraphsAndClusters() function allows
users to do parameter sweeps along either the number of neighbors or the
resolution of the clustering.
In this example we sweep along the value of the
resolution parameter for a Louvain clustering. For the SNN
graph constructions, edges are weighted according to the jaccard index
of their shared neighbors, mimicking the Seurat graph
construction and clustering procedure:
sce <- makeGraphsAndClusters(sce, k = c(0.1, 0.25, 0.5, 0.75, 1),
space = reducedDim(sce, "PCA_Harmony")[,1:20],
sweep_on = "clustering", method = "louvain",
weighting_scheme = "jaccard", prefix = "SNN_",
verbose = TRUE)
If another integration method has been run on the same object
(e.g. Seurat integration), then the clustering can be performed on that
integrated space by specifying the space argument (in this
case, reducedDim(sce, "PCA_Seurat")):
sce <- makeGraphsAndClusters(sce, k = c(0.1, 0.25, 0.5, 0.75, 1),
space = reducedDim(sce, "PCA_Seurat")
sweep_on = "clustering", method = "louvain",
weighting_scheme = "jaccard", prefix = "Seurat_SNN_",
verbose = TRUE)
The default value for space is NULL and
will use the "PCA" slot from the reducedDim()
accessor.
You can visualize clustering results on the UMAP using the
plot_UMAP() function, adding labels if desired:
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "SNN_0.5", label_by = "SNN_0.5")
If using clustree, the clustering tree can be visualized
by using the same prefix defined in
makeGraphsAndClusters():
library(clustree)
clustree(sce, prefix = "SNN_")
The default arguments to the clustering wrapper include the
generation of modularity and approximate silhouette scores for every
clustering round. These will be stored in the metadata of
the SCE, named according to the prefix, the resolution, and the
"modularity_" and "silhouette_" prefixes.
Silhouette and modularity can be visualized by using the dedicated
functions:
plotSilhouette(sce, "SNN_0.5")
plotModularity(sce, "SNN_0.5")
You can also use the plot_dots() function to plot the
popular dot-plot for marker genes.
This function takes in a SingleCellExperiment object,
together with a vector of genes (matched to the rownames of
the object), and a grouping variable specified by the
group_by argument. Additionally, dots can be ordered by
hierarchical clustering on either genes, groups, or both (set
cluster_genes and/or cluster_groups to
TRUE, which is the default). Colors can also be customized
via the color_palette argument. Finally, the user can
choose whether they want genes to be columns
(format = "wide", the default) or rows
(format = "tall").
# Quick and dirty marker calculation
markers = presto::wilcoxauc(sce, group_by = "SNN_0.5")
markerlist = split(markers, markers$group)
for(i in seq_len(length(markerlist))) {
markerlist[[i]]$deltapct = markerlist[[i]]$pct_in - markerlist[[i]]$pct_out
markerlist[[i]] = markerlist[[i]][order(markerlist[[i]]$deltapct, decreasing = TRUE),]
}
top5 = lapply(markerlist, function(x) x$feature[1:5])
markergenes = Reduce(union, top5)
plot_dots(sce, genes = top5, group_by = "SNN_0.5")
Aditionally, metaclusters can also be identified. A metacluster is a cluster of clusters obtained by different clustering methods. Clusters across methods are linked acording to how many cells they share, and these links become edges of a graph. Then, Louvain clustering is run on the graph and the communities that are identified are metaclusters. These metaclusters show the relationship between clustering methods. Moreover, they can be used to understand cluster stability along different parameters and/or integration methods. A cell can belong to different clusters according to the clustering method (i.e. to the resolution or to the space that was used). If a cell belongs to clusters that are consistently included in a metacluster, then that cell belongs to a “stable” cluster. If instead the cell belongs to clusters that have different metacluster assignments, then it’s in an “unstable” position, meaning it may be clustered differently according to integration methods and/or resolutions.
The metaClusters() function takes a
clusters argument, which is a vector of column names from
the colData of the SCE where clustering results are stored. In this
example it is easy to isolate by using grep() and searching
for the prefix “SNN_”.
clusterlabels <- colnames(colData(sce))[grep("SNN_", colnames(colData(sce)))]
sce <- metaCluster(sce, clusters = clusterlabels)
Every cell will belong to a series of clusters, which in turn belong
to a metacluster. For every cell, we count how many times they are
assigned to a particular metacluster, and the maximum metacluster is
assigned, together with a “metacluster score” (i.e. the frequency of
assignment to the maximum metacluster) and whether this score is above
or below a certain threshold (0.5 by default). These columns are saved
in the colData slot of the SCE.
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "metacluster_score", label_by = "SNN_0.5")
cellula implements 4 methods for automated cell identity
assignment, based on the Bioconductor AUCell package, the
GSVA ssGSEA implementation, the
Seurat AddModuleScore() function or the
UCell method.
The function requires a genesets named list containing
genes to be used for scoring every single cell. These can be obtained
through other packages, e.g. msigdbr. For instance, if we
wanted to take all the Muraro et al. signature genes, present in the C8
collection, we would do:
library(msigdbr)
type_genes = msigdbr("Homo sapiens", category = "C8")
genesets = lapply(split(type_genes, type_genes$gs_name), function(x) x$gene_symbol)
muraro_genes = genesets[grep("MURARO", names(genesets))]
Then, we would use the assignIdentities() function from
cellula to calculate signature scores:
sce = assignIdentities(sce,
genesets = muraro_genes,
method = "AUC")
Other methods are “Seurat”, “UCell”, and “ssGSEA”.
This will create a column named “labels_AUC” (or anything else the
user determines using the name argument) in the
colData(sce). Assignments can be plotted:
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "labels_AUC")
You can now see why it can be useful to plot a confusion matrix as a heatmap:
plot_Coldata(sce, x = "SNN_0.5", y = "labels_AUC")
You can also use single signatures as an input, which will result in
adding the score to the colData slot of the SCE directly,
rather than an assignment:
sce <- assignIdentities(sce,
genesets = muraro_genes$BETA_CELL,
method = "AUC",
name = "Beta_Cell_signature")
plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "Beta_Cell_signature")
The "UCell" method works well when you have small
signatures (e.g. even 2/3 genes). It allows you to specify positive and
negative labels, which is useful when you are sure the identity of a
cell types depends on the lack of expression of certain markers (see
hematopoietic lineages). To do so, you can add “+” or “-” to each
gene.
There are two ways to downsample data in cellula:
downsampling reads and downsampling
cells.
The first approach simulates reads randomly sampling counts from a
distribution with a fixed total number, using a vector of probabilities
equivalent to the per-gene proportion of reads within each cell.
Briefly, let’s consider a cell C in which genes
a, b, and c have been quantified with 50, 30,
20 counts each (totaling to 100 counts). This is equivalent to a bag of
marbles in which the probability of randomly picking an a
marble is 50/100 = 0.5, b marble is 0.3, and c marble
is 0.2. If we want to downsample C to a total of 40
counts (yielding the downsampled C’), we randomly pick
40 counts from a (0.5, 0.3, 0.2) vector of probabilities. This is a sort
of downsampling by simulation and is described in Scott Tyler’s work, reimplemented
in cellula with a slightly faster optimization.
The downsampleCounts() uses a minimum count number that
is user-defined (or the minimum total count number in the dataset as a
default) and returns a SingleCellExperiment object with the
same number of cells as the input, and a down-sampled count matrix where
each cell has the same total number of counts.
The second approach randomly select cells from within groups such as clusters, batches, or a combination of the two. Cells are randomly selected so that they represent a user-defined fraction of the within-group total, with some lower bound to ensure that small groups are represented: if a rare cluster label only contains 9 cells and we want to downsample a dataset to 10%, we can cap the minimum to 5 cells so that we ensure the rare label is still adequately represented.
The downsampleCells() function returns a
SingleCellExperiment object with fewer cells than the
input, as defined by the proportion and min
parameters.
At the time of writing cellula only implements a wrapper
around the slingshot method for pseudotemporal trajectory
inference, and the testPseudotime() method from
TSCAN for differential expression along a trajectory.
The findTrajectories() function takes a
SingleCellExperiment object as input, and requires the user
to specify the cluster label which will be used as an input to the MST
creation in slingshot. Other important parameters are
space - the reduced dimensional reduction in which
trajectories will be estimated (default is “PCA”), start -
the starting cluster for trajectory estimation, defaulting to “auto” for
an entropy-based method, omega - whether or not to use a
synthetic cluster to estimate disjointed trajectories, as detailed in
?slingshot::getLineages, and doDE - whether or
not to perform differential expression on every trajectory.